# Auto reload function (if modified outside notebook)
%load_ext autoreload
%autoreload 2
# Plot precision
%config InlineBackend.figure_format ='retina'
# Interactive cells
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
# Add css tags in Notebook
from IPython.display import display_html
import pandas as pd # Handle dataset
import pydot # Export decision tree (image...)
from scipy import stats # Stats modules
from sklearn.metrics import confusion_matrix, precision_recall_curve, roc_curve, auc # Metric functions
# Load the parent folder
import os, sys
BASE_DIR = os.path.dirname(os.path.abspath(''))
sys.path.append(BASE_DIR)
sys.path.append(os.path.join(BASE_DIR, '..'))
from notebook_methods import *
feature_names = ['percentile1', 'percentile2', 'percentile3', 'percentile4', 'percentile5',
'persistence1', 'persistence2', 'persistence3', 'persistence4', 'persistence5',
'tree1', 'tree2', 'tree3']
This notebook does not emphasize on the code itself, but the output collected from the different process. If you want more code details, please refer to the decision_tree.py file and for the whole project to this GitHub repo.
I. Context
II. Current situation
III. Methods
III.A. Random Forest
1. Number of estimators
2. Tree overview
3. Features importance
III.B. Others
1. Decision tree
2. Logistic regression
IV. With the 5 most important features
IV.I. Features importance
IV.II. Cuttof
IV.III. Precision - recall
V. Results
After processing all those images, the methods still fail on some pixels. That leads us to remove:
=> The final training (respectively evaluation) dataset is composed by 196964 (resp. 59076) samples (pixels).
The dataset looks like this:
# Load training - evaluation dataset
training = pd.read_excel("../Data/results.xlsx", sheet_name="Training")
evaluation = pd.read_excel("../Data/results.xlsx", sheet_name="Evaluation")
# Creating feature - output dataset
x_train, y_train = split_feature_output(training,feature_names, show_time= False)
x_test, y_test = split_feature_output(evaluation,feature_names, show_time=False)
# Create whole dataset
data = training.append(evaluation)
data.head()
Let's see the repartition of the different columns:
# Plot the repartition of cloud - Not cloud pixels per methods
plot_methods_repartition(data, feature_names)
Correlation
tree methods have a high one. tree methods are the column with the highest correlation to cloud column, meaning they are accurate methods.Note: having a full correlation doesn't mean the methods are accurate (all can't predict the same wrong output).
# Plot size
plt.figure(figsize=(20,10))
data[["cloud"] + feature_names].corr().style.background_gradient(cmap='coolwarm').set_precision(2)
I used the randomForestClassifier method from the sklearn package (doc). A lot of options are available. One important option is the number of samples used for building the randomForest model: n_estimators. Obviously, the bigger it is, the better the results are. However, using a lot of estimators is resource consuming.
The following figure has 2 plots:
res_rand_forest = compute_accuracy_over_nb_estimators(x_train, x_test, y_train, y_test, show_plot=True, progress_bar=False)
Strangely enough, the best accuracy is for a small number of estinators. However, these variations are very little (occur after the 3rd digit). The first fluctuation could be understood as noise. This noise is reduced when the number of estimators increases.
print("20 best accuracies (sorted):\n", res_rand_forest.sort_values(by="accuracies", ascending=False).head(20))
In the rest of the report, we will assume that : n_estimators=500. Here the model:
## Create Random forest model
# Next results assumes n_estimators = 500
nb_estimators = 500
# Create result dataframe + compute feature accuracy
res_accuracy = pd.DataFrame({"Methods": feature_names,
"Accuracy": [accuracy_score(data.cloud, data[col]) for col in feature_names]})
# Create Decision Tree classifer object
model_rf = RandomForestClassifier(n_estimators=nb_estimators, random_state=0)
# Fit model
model_rf.fit(x_train, y_train)
The depths of the trees in the random forest model (for 500 estimators) are the following:
# Predict the response for test dataset
y_pred = model_rf.predict(x_test)
# Add the accuracy to the accuracy results
add_row(res_accuracy, ["randomForest", accuracy_score(y_test, y_pred)])
# Save the first tree of the forest as a png image
export_graphviz(model_rf.estimators_[0], out_file='tree_rf.dot',
feature_names=list(training.columns[5:]),
class_names=["Not Cloudy", "Cloudy"],
rounded=True, precision=1)
(graph, ) = pydot.graph_from_dot_file('tree_rf.dot')
graph.write_png('tree_rf.png')
lengths = np.array([estimator.tree_.max_depth for estimator in model_rf.estimators_])
print("Length statistics of the trees in the randomForest model (n_estimatros=%d):" % (nb_estimators))
print("\t- Min: %3d\n\t- Max: %3d\n\t- Mean: %3.2f" % (min(lengths), max(lengths), sum(lengths)/len(lengths)))
nodes = np.array([estimator.tree_.node_count for estimator in model_rf.estimators_])
print("\nNodes count statistics of the trees in the randomForest model (n_estimatros=%d):" % (nb_estimators))
print("\t- Min: %3d\n\t- Max: %3d\n\t- Mean: %3.2f" % (min(nodes), max(nodes), sum(nodes)/len(nodes)))
The first tree of the decision tree looks like this:

The following graph shows the importance per feature. The 3 tree methods are the most important. This is consistent with the correlation matrix saw in section II.
tmp = pd.DataFrame({"Methods": feature_names,
"Importances": model_rf.feature_importances_}) \
.sort_values(by="Importances", ascending=False)
g = sns.barplot(x='Methods', y='Importances', data=tmp, )
_ = g.set_xticklabels(g.get_xticklabels(), rotation=45)
_ = plt.title("Importance per feature (Random forest model)")
plt.show()
The following graph shows the accuracy when the number of features used in the random forest decreases. Between each iteration is removed the feature having the smallest importance.
accuracy_over_nb_estimatros = accuracy_over_number_of_feature(training, evaluation, feature_names, nb_estimators)
accuracy_over_nb_estimatros \
.plot('Number_of_features', 'Accuracy', rot=0, title="Importance per feature (Random forest model)") \
.invert_xaxis()
I have also tried to build two other models: a logistic regression model and a decision tree model.
I used the DecisionTreeClassifier method from the sklearn package to build the decision tree model. The tree is similar to the one built in the random forest model.
The decision tree is also deep:
# Create Decision Tree classifer object
tree = DecisionTreeClassifier()
# Train Decision Tree Classifer
tree = tree.fit(x_train, y_train)
# Predict the response for test dataset
y_pred = tree.predict(x_test)
accuracy = accuracy_score(y_test, y_pred)
add_row(res_accuracy, ["decisionTree", accuracy])
# Save the tree as a png image
export_graphviz(model_rf.estimators_[0], out_file='decision_tree.dot',
feature_names=list(training.columns[5:]),
class_names=["Not Cloudy", "Cloudy"],
rounded=True, precision=1)
(graph, ) = pydot.graph_from_dot_file('decision_tree.dot')
graph.write_png('decision_tree.png')
print("Tree depth: ", tree.tree_.max_depth)
print("Number of node: ", tree.tree_.node_count)
The tree is similar to the ones from the random forest model:

tmp = dict(zip(feature_names, tree.feature_importances_))
tmp = {key:[value] for key, value in tmp.items()}
tmp = pd.DataFrame(tmp).T
tmp.columns = ["feature_importances"]
tmp = tmp.sort_values(by="feature_importances", ascending=False)
g = sns.barplot(x = tmp.index , y="feature_importances", data=tmp)
_ = g.set_xticklabels(g.get_xticklabels(), rotation=45)
_ = plt.title("Feature importance decision tree")
plt.show()
I use the LogisticRegression function from sklearn package. The model output is a score between 0 and 1. Then, I classify the output by comparing them to 0.5.
from sklearn.linear_model import LogisticRegression
# Create model
model = LogisticRegression(solver='liblinear')
# Fit model
model.fit(x_train, y_train)
# Predict the response for test dataset (proba output)
y_pred = model.predict(x_test)
# Categorize the output
y_pred = np.where(y_pred > 0.5, 1, 0)
accuracy = accuracy_score(y_test, y_pred)
add_row(res_accuracy, ["logisticRegression", accuracy])
pass
In the next section we deal with only the 5 most important features found in the plot III.A.3 : "tree3", "tree2", "tree1", "percentile5", "percentile1"
tree2 and tree1 having the highest correlation with the features percentile5 and percentile1.tree1 with a correlation of 0.58feature_names = ["tree3", "tree2", "tree1", "percentile5", "percentile1"]
data[feature_names].corr().style.background_gradient(cmap='coolwarm').set_precision(2)
The following plot shows the importance per feature, according to the column removed. We can see several points:
percentile1 method has a really little influence in all cases (always less than 5%). tree3 and tree2 methods always have the highest importance as far from the other features (always twice as much).tree1 feature, according to the criteria defined just above, slightly reduces the precision by 0.03 %feat_importance = feature_importance_per_model(training, evaluation, feature_names, nb_estimators=500)
plot_feature_importance_per_model(feat_importance, feature_names)
In the following section, we only consider the following features: tree3, tree2, percentile5 and percentile1.
Another important parameter in classification is the cuttof threshold. It gives the opportunity to tune the predictions (false positive, true positive...).
In our study, we are trying to predict if a cloud is a pixel or not. We want to detect cloudy pixels, rid of them and then interpolate them. Because we are repairing cloudy pixels in the next process step, predicting a pixel as non cloud whereas it is, does more harm than repairing a non cloudy pixel classed as cloudy.
In the following section, we will try to find a cuttof value that reduces the rate of False Negative (pixels predicted as cloud whereas they're not).
The following confusion matrix have the following structure: rows are the predictions while columns are the actual class.
| Class | Cloud | Not cloud |
|---|---|---|
| Cloud | True positive | False positive |
| Not cloud | False negative | True negative |
feature_names = ["tree3", "tree2", "percentile5", "percentile1"]
nb_estimators = 500
# Subset the datasets
sub_training = training[["cloud"] + feature_names]
sub_evaluation = evaluation[["cloud"] + feature_names]
# Creating feature - output dataset
x_train, y_train = split_feature_output(sub_training, feature_names, show_time=False)
x_test, y_test = split_feature_output(sub_evaluation, feature_names, show_time=False)
# Create Decision Tree classifer object
model_rf = RandomForestClassifier(n_estimators=nb_estimators, random_state=0)
# Fit model
_ = model_rf.fit(x_train, y_train)
# Eval the model with proba output
y_pred = model_rf.predict(x_test)
y_score = model_rf.predict_proba(x_test)
# Draw ROC Curve
fpr, tpr, auc_thresholds = roc_curve(y_test, y_score[:, 1])
plotly_roc_curve(fpr, tpr, auc_thresholds)
Choosing a threshold equals to 0.2972 leads to reduce the number of False Negative. However, the number of False Negative is still higher than the number of False Positive.
threshold = 0.2972
names = ["Cloud", "Not cloud"]
y_pred_29 = np.where(y_score[:,1] > threshold, 1, 0)
cm_29 = np.transpose(confusion_matrix(y_test, y_pred_29, labels=[1,0]))
cm_29 = pd.DataFrame(cm_29, columns=names, index=names)
# sum((y_pred_29 == 1) & (y_test == 0))
# sum((y_pred_29 == 1) & (y_test == 1))
# sum((y_pred_29 == 0) & (y_test == 1))
cm_50 = np.transpose(confusion_matrix(y_test, y_pred, labels=[1,0]))
cm_50 = pd.DataFrame(cm_50, columns=names, index=names)
# sum((y_pred == 1) & (y_test == 0))
# sum((y_pred == 1) & (y_test == 1))
# sum((y_pred == 0) & (y_test == 1))
accuracy = accuracy_score(y_test, y_pred_29)
res_accuracy = add_row(res_accuracy, ["randomForest V2", accuracy])
cm_50_styler = cm_50.style.set_table_attributes("style='display:inline; padding:10px;'").set_caption(
'<center>Confusion matrix <br>(threshold {:6.4f}):<center>'.format(0.5))
cm_29_styler = cm_29.style.set_table_attributes("style='display:inline; padding:10px;'").set_caption(
'<center>Confusion matrix <br>(threshold {:6.4f}):<center>'.format(threshold))
display_html(cm_50_styler._repr_html_() + cm_29_styler._repr_html_(), raw=True)
print("Accuracy {:6.4f}: ".format(0.5), accuracy_score(y_test, y_pred))
print("Accuracy {:6.4f}: ".format(threshold), accuracy_score(y_test, y_pred_29))
The precision and recall for a threshold equals to 0.2972 looks good (green circle) as shows the following plot.
precision, recall, thresholds = precision_recall_curve(y_test, y_score[:, 1])
plotly_precision_recall_vs_threshold(precision, recall, thresholds, 0.2972)
The results of the different models (accuracy):
plot_accuracy_over_methods(res_accuracy)
res_accuracy